The goal of this project is to investigate how partnerships involving
multiple top-tier players in the NBA impacts various performance
measures and team outcomes. Among the research questions we would like
to explore are the following:
[ INSERT RESEARCH QUESTIONS HERE ]
To be able to investigate, we need to pull data from multiple NBA
seasons. The script below provides code to create functions that pull
traditional stats for every player for a given user-defined season.
Load necessary libraries
library(rvest)
library(dplyr)
library(tidyverse)
── Attaching core tidyverse packages ─────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ readr 2.1.5
✔ ggplot2 3.5.1 ✔ stringr 1.5.1
✔ lubridate 1.9.3 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.1
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
Function to get NBA roster for a specified year
get_nba_roster <- function(year) {
# Construct the URL for the specified year
url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_per_game.html")
# Read the HTML content from the URL
webpage <- read_html(url)
# Extract the table containing the player statistics
roster_table <- webpage %>%
html_node("table#per_game_stats") %>%
html_table(fill = TRUE)
# Clean the data (remove header rows that might be duplicated)
roster_table <- roster_table %>%
filter(Player != "Player")
return(roster_table)
}
Example usage
year <- 2018 # Specify the year
nba_roster <- get_nba_roster(year)
#Print the first few rows of the roster
head(nba_roster)
tail(nba_roster)
NA
#Summary statistics
position_roster<-filter(nba_roster,Pos!="PG" )
position_roster
library(plotly)
# Assuming 'nba_roster' is your data frame
input <- nba_roster[, c('MP', 'PTS', 'Player','Pos')]
# Create the plotly scatter plot
fig <- plot_ly(input, x = ~MP, y = ~PTS, type = 'scatter', mode = 'markers',
text = ~Player, # This adds player names on hover
hoverinfo = 'text', # Ensures that only player names appear on hover
color = ~Pos, # Colors points based on position
marker = list(size = 10))
# Set the plot title and axis labels
fig <- fig %>% layout(title = "Minutes Played vs Points Scored",
xaxis = list(title = "Minutes Played", range = c(0, 48)),
yaxis = list(title = "Points", range = c(0, 35)))
# Show the plot
fig
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
#Data Visualization for Minutes Played vs Points Scored
input <- nba_roster[, c('MP', 'PTS')]
print(head(input))
# Get the input values.
input <- nba_roster[, c('MP', 'PTS')]
# Plot the chart for cars with
# weight between 1.5 to 4 and
# mileage between 10 and 25.
plot(x = input$MP, y = input$PTS,
xlab = "Minutes Played",
ylab = "Points",
xlim = c(0.0, 48),
ylim = c(0.0, 35),
main = "Minutes Played vs Points Scored"
)

#Data Visualization for Field Goals Attempled vs Field Goals Made
input_2 <- nba_roster[, c('FGA', 'FG')]
print(head(input_2))
# Get the input values.
input_2 <- nba_roster[, c('FGA', 'FG')]
# Plot the chart for players with
# field goal attempts between 0.0 to 25.0 and
b_FG<-max(input_2$FG,na.rm=T)
b_FGA<-max(input_2$FGA,na.rm=T)
# Field Goal Made between 0.0 and 25.0
plot(x = input_2$FGA, y = input_2$FG,
xlab = "Field Goal Attempts",
ylab = "Field Goal Made",
xlim = c(0.0, b_FGA),
ylim = c(0.0, b_FG),
main = "Field Goal Attempt vs Field Goal Made"
)

NA
NA
# Create the data for the chart
A <- c(nba_roster$PTS)
B <- c("PF", "PG", "SF", "C", "SG")
# Plot the bar chart
ggplot(nba_roster, aes(x=Pos, fill=PTS))+
geom_bar()+
theme_classic(16)+
xlab("Position")+
ylab("Points")
Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?

ASSIGNMENT 1: Is the data “clean”? Are there any
missing values to be accounted for/addressed? If there are any data
quality issues,
- a. propose a method to resolve them
Initial thought to change the default character to double given that
we have fractioned values. i think the columns should be changed from ”
chararcter” to “double”
b. justify the validity of your approach removing
observations with missing data from the dataset, using the function
“na.omit” which will remove rows with missing values from our
dataset
c. implement your proposed changes
For players who are missing data
nba_roster<-na.omit(nba_roster)
nba_roster
# Convert specific columns from character to double
# Convert all character columns to double
nba_roster %>%
mutate(across(G:PTS, as.numeric))
NA
NA
NA
To determine whether a player is “top tier” and should be considered
a part of a “Big 3” lineup, other authors have transformed traditional
stats to create metrics such as
PRA = POINTS + REBOUNDS + ASSISTS
We will consider advanced statistics such as PLAYER EFFIFIENCY
RATING:
PER = (PTS + REB + AST + STL + BLK − Missed FG − Missed FT - TO)
/GP
In particular, Value over Replacement (VORP) seems to do a solid job
of identifying the best players in the league.
The script below provide code to create functions that pull advanced
stats for every player for a given user-defined season.
# Function to get NBA advanced stats for a specified year
get_nba_advanced_stats <- function(year) {
# Construct the URL for the specified year
url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
# Read the HTML content from the URL
webpage <- read_html(url)
# Extract the table containing the advanced player statistics
advanced_stats_table <- webpage %>%
html_node("table#advanced_stats") %>%
html_table(fill = TRUE)
# Clean the data (remove header rows that might be duplicated)
# advanced_stats_table <- advanced_stats_table %>%
# filter(Player != "Player")
return(advanced_stats_table)
}
# Example usage
year <- 2018 # Specify the year
nba_advanced_stats <- get_nba_advanced_stats(year)
# Print the first few rows of the advanced stats
head(nba_advanced_stats)
NA
NA
ASSIGNMENT 2: Is the advanced data “clean”? Are
there any missing values to be accounted for/addressed? If there are any
data quality issues,
a. propose a method to resolve them
b. justify the validity of your approach
c. implement your proposed changes
cleaning similar to first one
The script below provide code to clean out the quality issues
presented in the dataframe
#1 We want to order the athletes name to alphabetical order to clean out the filler headers present
newdataframe<- dataframe[order(dataframe$Player)]
Error: object 'dataframe' not found
#want to order by alphabetic name to make cleaning out the filler headers from the dataset
AO_nba_advanced_stats <- nba_advanced_stats[order(nba_advanced_stats$Player),]
# remove filler rows that had been previously used as headers on webpage
AO_nba_advanced_stats<- AO_nba_advanced_stats[-c(502:526), ]
#remove na from dataframe
AO_nba_advanced_stats %>%
select(where(~!all(is.na(.))))
#removing column 20 and 25 from dataframe since theyre blanks
AO_nba_advanced_stats<-AO_nba_advanced_stats[,-20]
AO_nba_advanced_stats<-AO_nba_advanced_stats[,-24]
#change range of cloumns <dbl> from <chr>
AO_nba_advanced_stats %>%
mutate(across(G:VORP, as.numeric))
Warning: There were 22 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(G:VORP, as.numeric)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 21 remaining warnings.
ASSIGNMENT 3: Merge the cleaned up datasets to
create one new data frame with the traditional and advanced
stats.
#how to merge two files into one new data frame
nba_merge<-merge(nba_roster, AO_nba_advanced_stats, by = c("Rk", "Player", "Pos","Age", "Tm","G"))#, by.y = c("Rk", "Player", "Pos","Age", "Tm","G") , all.x = TRUE, all.y = TRUE)
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
ASSIGNMENT 4: Make a function with argument
year that outputs one dataframe with the merged traditional
and advanced data.
combined_nba_stats<-function(year){
get_nba_roster2 <- function(year) {
# Construct the URL for the specified year
url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_per_game.html")
# Read the HTML content from the URL
webpage <- read_html(url)
# Extract the table containing the player statistics
roster_table <- webpage %>%
html_node("table#per_game_stats") %>%
html_table(fill = TRUE)
# Clean the data (remove header rows that might be duplicated)
roster_table <- roster_table %>%
filter(Player != "Player")
return(roster_table)
}
year <- 2023 # Specify the year
nba_roster2 <- get_nba_roster2(year)
#Print the first few rows of the roster
head(nba_roster)
tail(nba_roster)
#take out the N/A
nba_roster2<-na.omit(nba_roster2)
# Convert specific columns from character to double
nba_roster2 %>%
mutate(across(G:PTS, as.numeric))
#ADVANCED STATS
# Function to get NBA advanced stats for a specified year
get_nba_advanced_stats <- function(year) {
# Construct the URL for the specified year
url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
# Read the HTML content from the URL
webpage <- read_html(url)
# Extract the table containing the advanced player statistics
advanced_stats_table <- webpage %>%
html_node("table#advanced_stats") %>%
html_table(fill = TRUE)
# Clean the data (remove header rows that might be duplicated)
# advanced_stats_table <- advanced_stats_table %>%
# filter(Player != "Player")
return(advanced_stats_table)
}
# Example usage
year <- 2023 # Specify the year
nba_advanced_stats2<- get_nba_advanced_stats(year)
# Print the first few rows of the advanced stats
head(nba_advanced_stats2)
#want to order by alphabetic name to make cleaning out the filler headers from the dataset
AO_nba_advanced_stats2<- nba_advanced_stats2[order(nba_advanced_stats2$Player),]
#remove na from dataframe
AO_nba_advanced_stats2 %>%
select(where(~!all(is.na(.))))
#removing column 20 and 25 from dataframe since theyre blanks
AO_nba_advanced_stats2<-AO_nba_advanced_stats2[,-20]
AO_nba_advanced_stats2<-AO_nba_advanced_stats2[,-24]
# remove filler rows that had been previously used as headers on webpage
AO_nba_advanced_stats2 <- AO_nba_advanced_stats2[AO_nba_advanced_stats2$Player != "Player",]
AO_nba_advanced_stats2$Player <- factor(AO_nba_advanced_stats2$Player)
#change range of cloumns <dbl> from <chr>
AO_nba_advanced_stats2 %>%
mutate(across(G:VORP, as.numeric))
nba_merge<-merge(nba_roster2, AO_nba_advanced_stats2, by.x = c("Rk", "Player", "Pos","Age", "Tm","G"), by.y = c("Rk", "Player", "Pos","Age", "Tm","G") , all.x = TRUE, all.y = TRUE)
}
ASSIGNMENT 5: Make this file more visually
appealng, with headers, bullet points, sections and subsections as you
see fit. You may consider migrating over to Quarto for this
reason.
---
title: "R Notebook"
output: html_notebook
---

The goal of this project is to investigate how partnerships involving multiple top-tier players in the NBA impacts various performance measures and team outcomes. Among the research questions we would like to explore are the following:

*[ INSERT RESEARCH QUESTIONS HERE ]*

To be able to investigate, we need to pull data from multiple NBA seasons. The script below provides code to create functions that pull traditional stats for every player for a given user-defined season.

Load necessary libraries
```{r}
library(rvest)
library(dplyr)
library(tidyverse)
```

Function to get NBA roster for a specified year
```{r}
get_nba_roster <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_per_game.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)


  # Extract the table containing the player statistics
  roster_table <- webpage %>%
    html_node("table#per_game_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
  roster_table <- roster_table %>%
    filter(Player != "Player")
    return(roster_table)
}
```

Example usage
```{r}
year <- 2018  # Specify the year
nba_roster <- get_nba_roster(year)

#Print the first few rows of the roster
head(nba_roster)
tail(nba_roster)

```
```{r}
#Summary statistics

position_roster<-filter(nba_roster,Pos!="PG" )
position_roster
```



```{r}

library(plotly)

# Assuming 'nba_roster' is your data frame
input <- nba_roster[, c('MP', 'PTS', 'Player','Pos')]

# Create the plotly scatter plot
fig <- plot_ly(input, x = ~MP, y = ~PTS, type = 'scatter', mode = 'markers',
               text = ~Player,  # This adds player names on hover
               hoverinfo = 'text', # Ensures that only player names appear on hover
               color = ~Pos,  # Colors points based on position
               marker = list(size = 10))

# Set the plot title and axis labels
fig <- fig %>% layout(title = "Minutes Played vs Points Scored",
                      xaxis = list(title = "Minutes Played", range = c(0, 48)),
                      yaxis = list(title = "Points", range = c(0, 35)))

# Show the plot
fig


```
```{r}
#Data Visualization for Minutes Played vs Points Scored
input <- nba_roster[, c('MP', 'PTS')]
print(head(input))

# Get the input values.
input <- nba_roster[, c('MP', 'PTS')]

# Plot the chart for cars with
# weight between 1.5 to 4 and
# mileage between 10 and 25.
plot(x = input$MP, y = input$PTS,
	xlab = "Minutes Played",
	ylab = "Points",
	xlim = c(0.0, 48),
	ylim = c(0.0, 35),	 
	main = "Minutes Played vs Points Scored"
)

```


```{r}
#Data Visualization for Field Goals Attempled vs Field Goals Made
input_2 <- nba_roster[, c('FGA', 'FG')]
print(head(input_2))

# Get the input values.
input_2 <- nba_roster[, c('FGA', 'FG')]

# Plot the chart for players with
# field goal attempts between 0.0 to 25.0 and
b_FG<-max(input_2$FG,na.rm=T)
b_FGA<-max(input_2$FGA,na.rm=T)
# Field Goal Made between 0.0 and 25.0
plot(x = input_2$FGA, y = input_2$FG,
	xlab = "Field Goal Attempts",
	ylab = "Field Goal Made",
	xlim = c(0.0, b_FGA),
	ylim = c(0.0, b_FG),	 
	main = "Field Goal Attempt vs Field Goal Made"
)


```



```{r}
# Create the data for the chart
A <- c(nba_roster$PTS)
B <- c("PF", "PG", "SF", "C", "SG")

# Plot the bar chart
ggplot(nba_roster, aes(x=Pos, fill=PTS))+  

geom_bar()+  

theme_classic(16)+  

xlab("Position")+  

ylab("Points") 

```


**ASSIGNMENT 1:** *Is the data "clean"? Are there any missing values to be accounted for/addressed? If there are any data quality issues,*

 - *a. propose a method to resolve them*
       
Initial thought to change the default character to double given that we have fractioned values. i think the columns should be changed from " chararcter" to "double" 


 - *b. justify the validity of your approach*
removing observations with missing data from the dataset, using the function "na.omit" which will remove rows with missing values from our dataset


 - *c. implement your proposed changes*


For players who are missing data
```{r}
nba_roster<-na.omit(nba_roster)

nba_roster




# Convert specific columns from character to double
# Convert all character columns to double
nba_roster %>%
   mutate(across(G:PTS, as.numeric))



```

To determine whether a player is "top tier" and should be considered a part of a "Big 3" lineup, other authors have transformed traditional stats to create metrics such as

PRA = POINTS + REBOUNDS + ASSISTS 

We will consider advanced statistics such as PLAYER EFFIFIENCY RATING:

PER = (PTS + REB + AST + STL + BLK − Missed FG − Missed FT - TO) /GP

In particular, Value over Replacement (VORP) seems to do a solid job of identifying the best players in the league.

The script below provide code to create functions that pull advanced stats for every player for a given user-defined season.
```{r}

# Function to get NBA advanced stats for a specified year
get_nba_advanced_stats <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)
  
  # Extract the table containing the advanced player statistics
  advanced_stats_table <- webpage %>%
    html_node("table#advanced_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
 # advanced_stats_table <- advanced_stats_table %>%
  #  filter(Player != "Player")
  
  return(advanced_stats_table)
}

# Example usage
year <- 2018  # Specify the year
nba_advanced_stats <- get_nba_advanced_stats(year)

# Print the first few rows of the advanced stats
head(nba_advanced_stats)


```

**ASSIGNMENT 2:** *Is the advanced data "clean"? Are there any missing values to be accounted for/addressed? If there are any data quality issues,*

 - *a. propose a method to resolve them*

 - *b. justify the validity of your approach*

 - *c. implement your proposed changes*
 
 cleaning similar to first one 
 
 
 The script below provide code to clean out the quality issues presented in the dataframe
 
 
 
```{r}
#1 We want to order the athletes name to alphabetical order to clean out the filler headers present

newdataframe<- dataframe[order(dataframe$Player)]

#2 Now we want to remove the filler rows that had been used as headers on the webpage

newdata.frame<-dataframe[-c(502:526), ]

#3 now we want to remove all the N/As from the dataset
dataframe %>% 
  select(where(~!all(is.na(.))))
```
 
 
 
```{r}

#want to order by alphabetic name to make cleaning out the filler headers from the dataset
AO_nba_advanced_stats <- nba_advanced_stats[order(nba_advanced_stats$Player),]



# remove filler rows that had been previously used as headers on webpage
AO_nba_advanced_stats<- AO_nba_advanced_stats[-c(502:526), ]


#remove na from dataframe
AO_nba_advanced_stats %>% 
  select(where(~!all(is.na(.))))
#removing column 20 and 25 from dataframe since theyre blanks
AO_nba_advanced_stats<-AO_nba_advanced_stats[,-20]
AO_nba_advanced_stats<-AO_nba_advanced_stats[,-24]


#change range of cloumns <dbl> from <chr>

AO_nba_advanced_stats %>%
   mutate(across(G:VORP, as.numeric))




```


**ASSIGNMENT 3:** *Merge the cleaned up datasets to create one new data frame with the traditional and advanced stats.*



```{r}
#how to merge two files into one new data frame
nba_merge<-merge(nba_roster, AO_nba_advanced_stats, by.x = c("Rk", "Player", "Pos","Age", "Tm","G"), by.y = c("Rk", "Player", "Pos","Age", "Tm","G") , all.x = TRUE, all.y = TRUE)


head(nba_merge)
```


**ASSIGNMENT 4:** *Make a function with argument `year` that outputs one dataframe with the merged traditional and advanced data.* 

```{r}

combined_nba_stats<-function(year){
get_nba_roster2 <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_per_game.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)


  # Extract the table containing the player statistics
  roster_table <- webpage %>%
    html_node("table#per_game_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
  roster_table <- roster_table %>%
    filter(Player != "Player")
    return(roster_table)
}
  
  year <- 2023  # Specify the year
nba_roster2 <- get_nba_roster2(year)

#Print the first few rows of the roster
head(nba_roster)
tail(nba_roster)


#take out the N/A 
nba_roster2<-na.omit(nba_roster2)


# Convert specific columns from character to double

nba_roster2 %>%
   mutate(across(G:PTS, as.numeric))

#ADVANCED STATS

# Function to get NBA advanced stats for a specified year
get_nba_advanced_stats <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)
  
  # Extract the table containing the advanced player statistics
  advanced_stats_table <- webpage %>%
    html_node("table#advanced_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
 # advanced_stats_table <- advanced_stats_table %>%
  #  filter(Player != "Player")
  
  return(advanced_stats_table)
}

# Example usage
year <- 2023  # Specify the year
nba_advanced_stats2<- get_nba_advanced_stats(year)

# Print the first few rows of the advanced stats
head(nba_advanced_stats2)



#want to order by alphabetic name to make cleaning out the filler headers from the dataset
AO_nba_advanced_stats2<- nba_advanced_stats2[order(nba_advanced_stats2$Player),]





#remove na from dataframe
AO_nba_advanced_stats2 %>% 
  select(where(~!all(is.na(.))))
#removing column 20 and 25 from dataframe since theyre blanks
AO_nba_advanced_stats2<-AO_nba_advanced_stats2[,-20]
AO_nba_advanced_stats2<-AO_nba_advanced_stats2[,-24]


# remove filler rows that had been previously used as headers on webpage


AO_nba_advanced_stats2 <- AO_nba_advanced_stats2[AO_nba_advanced_stats2$Player != "Player",]

AO_nba_advanced_stats2$Player <- factor(AO_nba_advanced_stats2$Player)




#change range of cloumns <dbl> from <chr>

AO_nba_advanced_stats2 %>%
   mutate(across(G:VORP, as.numeric))
   
   nba_merge<-merge(nba_roster2, AO_nba_advanced_stats2, by.x = c("Rk", "Player", "Pos","Age", "Tm","G"), by.y = c("Rk", "Player", "Pos","Age", "Tm","G") , all.x = TRUE, all.y = TRUE)
   
}



```


**ASSIGNMENT 5:** *Make this file more visually appealng, with headers, bullet points, sections and subsections as you see fit. You may consider migrating over to Quarto for this reason.*


